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Abstract 

We prove rates of convergence in the statistical sense for kernel-based least squares 
regression using a conjugate gradient algorithm, where regularization against overfit- 
ting is obtained by early stopping. This method is directly related to Kernel Partial 
Least Squares, a regression method that combines supervised dimensionality reduction 
with least squares projection. The rates depend on two key quantities: first, on the 
regularity of the target regression function and second, on the intrinsic dimensionality 
of the data mapped into the kernel space. Lower bounds on attainable rates depend- 
ing on these two quantities were established in earlier literature, and we obtain upper 
bounds for the considered method that match these lower bounds (up to a log factor) 
if the true regression function belongs to the reproducing kernel Hilbert space. If this 
assumption is not fulfilled, we obtain similar convergence rates provided additional 
unlabeled data are available. The order of the learning rates match state-of-the-art 
results that were recently obtained for least squares support vector machines and for 
linear regularization operators. 

1 Introduction 

The contribution of this paper is the learning theoretical analysis of kernel-based least 
squares regression in combination with conjugate gradient techniques. The goal is to esti- 
mate a regression function /* based on random noisy observations. We have an i.i.d. sample 
of n observations (X«, Yj) G X xl from an unknown distribution P(X, Y) that follows the 
model 

Y = f*(X) + e, 
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where e is a noise variable whose distribution can possibly depend on X, but satisfies 
E [e\X] = 0. We assume that the true regression function /* belongs to the space /^(-Px - ) 
of square-integrable functions. Following the kernelization principle, we implicitly map the 
data into a reproducing kernel Hilbert space H with a kernel k. We denote by K n = 
±(k(Xi,Xj)) G R nxn the normalized kernel matrix and by Y = (Y 1 , . . . ,Y n ) T E R n the 
n-vector of response observations. The task is to find coefficients a such that the function 
defined by the normalized kernel expansion 

1 - 

f a (X) = -J2<Xik(Xi,X) 

i=l 

is an adequate estimator of the true regression function /*. The closeness of the estimator 
f a to the target /* is measured via the C 2 (Px) distance, 

||/„-rila = ®x~p x [(/«(*) - f*(X)) 2 ] 

= E XY [(f a (X) - Yf] - E XY [(f*(X) - Yf] , 

The last equality recalls that this criterion is the same as the excess generalization error for 
the squared error loss £(f,x,y) = (f(x) — y) 2 . 

In empirical risk minimization, we use the training data empirical distribution as a proxy 
for the generating distribution, and minimize the training squared error. This gives rise to 
the linear equation 

K n a = Y with aeR n . (1) 

Assuming K n invertible, the solution of the above equation is given by a = K~ X Y , which 
yields a function in H interpolating perfectly the training data but having poor generaliza- 
tion error. It is well-known that to avoid overfitting, some form of regularization is needed. 
There is a considerable variety of possible approaches (see e.g. [12] for an overview). Perhaps 
the most well-known one is 

a = (K n + XI)- 1 Y, (2) 

known alternatively as kernel ridge regression, least squares support vector machine, or 
Tikhonov's regularization. A powerful generalization of this is to consider 

a = F\(K n )Y, (3) 

where F\ : M + — > M + is a fixed function depending on a parameter A. The notation F\(K n ) 
is to be interpreted as F\ applied to each eigenvalue of K n in its eigen decomposition. 
Intuitively, F\ should be a "regularized" version of the inverse function F(x) = x~ x . This 
type of regularization, which we refer to as linear regularization methods, is directly inspired 
from the theory of inverse problems. Popular examples include as particular cases kernel 
Ridge Regression, Principal components regression and L 2 -Boosting. Their application in 
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a learning context has been studied extensively [2j [3j El [7J HI]. Results obtained in this 
framework will serve as a comparison yardstick in the sequel. 

In this paper, we study conjugate gradient (CG) techniques in combination with early 
stopping for the regularization of the kernel based learning problem ([1]). The principle of CG 
techniques is to restrict the learning problem onto a nested set of data-dependent subspaces, 
the so-called Krylov subspaces, defined as 

MY, K n ) = span { Y, K n Y, . . . , K™~ l Y} . (4) 

Denote by (., .) the usual euclidean scalar product on M. n rescaled by the factor n -1 . We 
define the i^ n -norm as ||a||^- n = (a,K n a) . The CG solution after m iterations is formally 
defined as 

a m = arg mm ||Y - K n a\\ Kn ; (5) 

aelC m (Y,K n ) 

and the number m of CG iterations is the model parameter. To simplify notation we define 
fm := fa m - m the learning context considered here, regularization corresponds to early 
stopping. Conjugate gradients have the appealing property that the optimization criterion 
can be computed by a simple iterative algorithm that constructs basis vectors d\, . . . , d m 
of /C m (Y, K n ) by using only forward multiplication of vectors by the matrix K n . Algorithm 
[T] displays the computation of the CG kernel coefficients a m defined by (J5J). 

Algorithm 1 Kernel Conjugate Gradient regression 

Input kernel matrix K n , response vector Y, maximum number of iterations m 
Initialization: «o = n ; r% = Y; di = Y 
for i = 1, . . . , m do 

di = di/\\K n di\\K n (normalization of the basis vector) 
7i = (Y,K n di) Kn (step size) 
Oi = Oii-i + %di (update) 
r i+ i =n — Kn'jidi (residuals) 

d i+1 = r i+ i - (K n di,r i+1 ) Kn /\\r i+1 \\ 2 Kn (new basis vector) 
end for 

Return: CG kernel coefficients a m , CG function f m = YH=i a i,m,k(Xi, •) 



The CG approach is also inspired by the theory of inverse problems, but it is not covered 
by the framework of linear operators defined in ([3]): As we restrict the learning problem 
onto the Krylov space /C m (Y, K n ) , the CG coefficients a m are of the form a m = q m (K n )Y 
with q m a polynomial of degree < m — 1. However, the polynomial q m is not fixed but 
depends on Y as well, making the CG method nonlinear in the sense that the coefficients 
a m depend on Y in a nonlinear fashion. 

We remark that in machine learning, conjugate gradient techniques are often used as 
fast solvers for operator equations, e.g. to obtain the solution for the regularized equation 
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(|2]). We stress that in this paper, we study conjugate gradients as a regularization approach 
for kernel based learning, where the regularity is ensured via early stopping. This approach 
is not new. As mentioned in the abstract, the algorithm that we study is closely related 
to Kernel Partial Least Squares [21] . The latter method also restricts the learning problem 
onto the Krylov subspace fC m (Y,K n ), but it minimizes the euclidean distance ||Y — K n a\\ 
instead of the distance || Y— K n a\\x n defined abovd^- Kernel Partial Least Squares has shown 
competitive performance in benchmark experiences (see e.g [21], [22]). Moreover, a similar 
conjugate gradient approach for non-definite kernels has been proposed and empirically 
evaluated by Ong et al [19]. The focus of the current paper is therefore not to stress the 
usefulness of CG methods in practical applications (and we refer to the above mentioned 
references) but to examine its theoretical convergence properties. In particular, we establish 
the existence of early stopping rules that lead to optimal convergence rates. We summarize 
our main results in the next session. 



2 Main results 

For the presentation of our convergence results, we require suitable assumptions on the 
learning problem. We first assume that the kernel space H is separable and that the kernel 
function is measurable. (This assumption is satisfied for all practical situations that we 
know of.) Furthermore, for all results, we make the (relatively standard) assumption that 
the kernel is bounded: 

k(x, x) < k for all x 6 X . (6) 
We consider - depending on the result - one of the following assumptions on the noise: 
(Bounded) (Bounded Y): \Y\ < M almost surely. 

(Bernstein) (Bernstein condition): E[e p |X] < (l/2)p\M p almost surely, for all integers 
P > 2- 

The second assumption is weaker than the first. In particular, the first assumption implies 
that not only the noise, but also the target function /* is bounded in supremum norm, while 
the second assumption does not put any additional restriction on the target function. 

The regularity of the target function /* is measured in terms of a source condition as 
follows. The kernel integral operator is given by 

K:C 2 (P X )^C 2 (P X ), J k(.,x)g(x)dP(x). 

The source condition for the parameter r > is defined by: 

SC(r) : /* = K r u with ||u|| < k~ t p. 

x This is generalized to a CG-1 algorithm (I E N>o) by replacing the isf„-norm in (JSJ with the norm 
defined by K l n . Corresponding fast iterative algorithms to compute the solution exist for all I (see e.g. [T5] ) . 
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It is a known fact (see, e.g., [10]) that if r > 1/2, then /* coincides almost surely with a 
function belonging to Hk- Therefore, we refer to r > 1/2 as the "inner case" and to r < 1/2 
as the "outer case" . 

The regularity of the kernel operator K with respect to the marginal distribution Px is 
measured in terms of the intrinsic dimensionality parameter, defined by the condition 

ID(s) : tv(K(K + XI)- 1 ) < D 2 (k- 1 X)- s for all A G (0, 1]. 

It is known that the best attainable rates of convergence, as a function of the number of 
examples n, is determined by the parameters r and s. It was shown in [TT] that the minimax 
learning rate given these two parameters is lower bounded by 0{n~ 2r ^ 2r+s ' > ). 

We now expose our main results in different situations. In all the cases considered, the 
early stopping rule takes the form of a so-called discrepancy stopping rule: For some 
sequence of thresholds A m to be specified (and possibly depending on the data), define the 
(data- dependent) stopping iteration rh as the first iteration m for which 



(fm(Xi), ■ ■ ■ j fm(X n )) 



Y 



K„ 



(7) 



(Only in the first result below, the threshold A m actually depends on the iteration m and 
on the data.) 



2.1 Inner case without knowledge on intrinsic dimension 

The inner case corresponds to r > 1/2, i.e. the target function /* lies in H almost surely. 
For some constants r > 1 and 1 > 7 > 0, we consider the discrepancy stopping rule with 
the threshold sequence 



Am = 4r ^ — I (y^ \\ am \\ Kn + MVM27- 1 )) . (8) 

For technical reasons, we consider a slight variation of the rule in that we stop at step m — \ 
instead of rh if qfh(0) > 4/t A/Tog(27~ 1 ) / n, where q m is the iteration polynomial such that 
OLm — 1m(Kn)Y. Denote rh the resulting stopping step. We obtain the following result. 

Theorem 2.1. Suppose that Y is bounded (Bounded), and that the source condition SC(r ) 
holds for r > 1/2. With probability 1 — 27, the estimator fm obtained by the (modified) 
discrepancy stopping rule (JSJ satisfies 

H/«-/lB<c(r,r)(M + p) 2 ^ 
We present the proof in Section |U 
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2.2 Optimal rates in inner case 



We now introduce a stopping rule yielding order-optimal convergence rates as a function of 
the two parameters r and s in the "inner" case (r > 1/2, which is equivalent to saying that 
the target function belongs to H almost surely). For some constant r' > 3/2 and 1 > 7 > 0, 
we consider the discrepancy stopping rule with the fixed threshold 

( 2r+l 
AD fi\ 2r + s 
-Hog- • ( 9 ) 

for which we obtain the following: 

Theorem 2.2. Suppose that the noise fulfills the Bernstein assumption (Bernstein), that 
the source condition SC(r) holds forr > 1/2, and thatYD(s) holds. With probability 1 — 37, 
the estimator obtained by the discrepancy stopping rule satisfies 

2r 

Due to space limitations, the proof is presented in the supplementary material. 



2.3 Optimal rates in outer case, given additional unlabeled data 

We now turn to the "outer" case. In this case, we make the additional assumption that un- 
labeled data is available. Assume that we have n i.i.d. observations X\, . . . , X^, out of which 
only the first n are labeled. We define a new response vector Y = - (Y 1} . . . , Y n , 0, . . . , 0) e 
M. n and run the CG algorithm [T] on Xi, . . . and Y. We use the same threshold (jU]) 
as in the previous section for the stopping rule, except that the factor M is replaced by 
max(M, p). 

Theorem 2.3. Suppose assumptions (Bounded), SC(r) and ID(s), with r + s > ~. 
Assume unlabeled data is available with 

(l-2r) + 

->( log 2 - 

n \ n 7/ 

Then with probability 1 — 37 , the estimator obtained by the discrepancy stopping rule 
defined above satisfies 

2r 

^log 2 ^) . 

n 7/ 

A sketch of the proof can be found in the supplementary material. 
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3 Discussion and comparison to other results 



For the inner case - i.e. /* G H almost surely - we provide two different consistent stopping 
criteria. The first one (Section 12. ip is oblivious to the intrinsic dimension parameter s, and 
the obtained bound corresponds to the "worst case" with respect to this parameter (that is, 
s = 1). However, an interesting feature of stopping rule (jHJ) is that the rule itself does not 
depend on the a priori knowledge of the regularity parameter r, while the achieved learning 
rate does (and with the optimal dependence in r when s = 1). Hence, Theorem 12. II implies 
that the obtained rule is automatically adaptive with respect to the regularity of the target 
function. This contrasts with the results obtained in [2] for linear regularization schemes of 
the form (j^, (also in the case s — 1) for which the choice of the regularization parameter A 
leading to optimal learning rates required the knowledge or r beforehand. 

When taking into account also the intrinsic dimensionality parameter s, Theorem 12.21 
provides the order-optimal convergence rate in the inner case (up to a log factor). A notice- 
able difference to Theorem 12.11 however, is that the stopping rule is no longer adaptive, that 
is, it depends on the a priori knowledge of parameters r and s. We observe that previously 
obtained results for linear regularization schemes of the form ([2]) in [7j and of the form ([3]) in 
[6] , also rely on the a priori knowledge of r and s to determine the appropriate regularization 
parameter A. 

The outer case - when the target function does not lie in the reproducing Kernel Hilbert 
space % — is more challenging and to some extent less well understood. The fact that 
additional assumptions are made is not a particular artefact of CG methods, but also ap- 
pears in the studies of other regularization techniques. Here we follow the semi-supervised 
approach that is proposed in e.g. j6] (to study linear regularization of the form (j3J)) and 
assume that we have sufficient additional unlabeled data in order to ensure learning rates 
that are optimal as a function of the number of labeled data. We remark that other forms 
of additional requirements can be found in the recent literature in order to reach optimal 
rates. For regularized M-estimation schemes studied in |23j, availability of unlabeled data 
is not required, but a condition is imposed of the form WfW^ < C \\f\\\i ||/||2~ P f° r all / G % 
and some p G (0, 1]. In [T5|, assumptions on the supremum norm of the eigenfunctions of 
the kernel integral operator are made (see [23] for an in-depth discussion on this type of 
assumptions). 

Finally, as explained in the introduction, the term 'conjugate gradients' comprises a 
class of methods that approximate the solution of linear equations on Krylov subspaces. 
In the context of learning, our approach is most closely linked to Partial Least Squares 
(PLS) [21] and its kernel extension [21]. While PLS has proven to be successful in a wide 
range of applications and is considered one of the standard approaches in chemometrics, 
there are only few studies of its theoretical properties. In [9j [16], consistency properties 
are provided for linear PLS under the assumption that the target function /* depends 
on a finite known number of orthogonal latent components. These findings were recently 
extended to the nonlinear case and without the assumption of a latent components model 
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[I], but all results come without optimal rates of convergence. For the slightly different CG 
approach studied by Ong et al [19] . bounds on the difference between the empirical risks of 
the CG approximation and of the target function are derived in [18J , but no bounds on the 
generalization error were derived. 

4 Proofs 

Convergence rates for regularization methods of the type (TS]) or ([3]) have been studied by 
casting kernel learning methods into the framework of inverse problems (see [EE]). We use 
this framework for the present results as well, and recapitulate here some important facts. 
We first define the empirical evaluation operator T n as follows: 

T n : g E H T n g := (g^), . . .,g(X n )) T E R n 

and the empirical integral operator T* as: 

1 n 

T*:u = K . . . , u n ) E R n H- T*u := - V Uik{X h •) E H. 

n z — ' 

i=l 

Using the reproducing property of the kernel, it can be readily checked that T n and T* are 
adjoint operators, i.e. they satisfy (T*u, g) n = (u, T n g), for all u E M. n , g EH . Furthermore, 
K n = T n T*, and therefore ^ = Based on these facts, equation (jSJ) can be 

rewritten as 

/ m = arg min ||T*Y - 5 ft /||„ , (10) 

/eiC m (T*Y,S n ) 

where S n = T*T n is a self-adjoint operator of H, called empirical covariance operator. This 
definition corresponds to that of the "usual" conjugate gradient algorithm formally applied 
to the so-called normal equation (in Ti) 

S n fa = T n ^Y , 

which is obtained from ([1]) by left multiplication by T*. The advantage of this reformulation 
is that it can be interpreted as a "perturbation" of a population, noiseless version (of the 
equation and of the algorithm), wherein Y is replaced by the target function /* and the 
empirical operator T*, T n are respectively replaced by their population analogues, the kernel 
integral operator 

T* :gEL 2 (P x )^T*g := J k(., x)g(x)dP x (x) = E [k(X, -)g(X)] E H , 

and the change-of-space operator 

T: ge n^gEC 2 (P x ). 
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The latter maps a function to itself but between two Hilbert spaces which differ with respect 
to their geometry - the inner product of H being defined by the kernel function k, while the 
inner product of £2(Px) depends on the data generating distribution (this operator is well 
defined: since the kernel is bounded, all functions in "H are bounded and therefore square 
integrable under any distribution P x )- 

The following results, taken from [2] (Propositions 21 and 22) quantify more precisely 
that the empirical covariance operator S n = T*T n and the empirical integral operator applied 
to the data, T*Y, are close to the population covariance operator S = T*T and to the kernel 
integral operator applied to the noiseless target function, T* f* respectively. 

Proposition 4.1. Provided that condition ([6]) is true, the following holds: 



P 



Sn 



S\\hs — 



4k 



log 



7 



>l-7: 



where \\-\\ HS denotes the Hilbert- Schmidt norm. If the representation f* = Tf^ holds, and 
under assumption (Bernstein), we have the following: 



P 



I n 



Sfn\ 



AMJ~k . 2 
< —, ^-log- 

'n 7 



> 1 -7 



(12) 



We note that /* = Tfy implies that the target function /* coincides with a function f^ 
belonging to % (remember that T is just the change-of-space operator). Hence, the second 
result (IT2|) is valid for the case with r > 1/2, but it is not true in general for r < 1/2 . 



4.1 Nemirovskii's result on conjugate gradient regularization rates 

We recall a sharp result due to Nemirovskii [T7] establishing convergence rates for conjugate 
gradient methods in a deterministic context. We present the result in an abstract context, 
then show how, combined with the previous section, it leads to a proof of Theorem 12.11 
Consider the linear equation 

Az* = b , 

where A is a bounded linear operator over a Hilbert space T-L . Assume that the above 
equation has a solution and denote z* its minimal norm solution; assume further that a 
self-adjoint operator A, and an element b G T-L are known such that 

\\A- A\\ <5; \\b-b\\ <e, (13) 

(with S and e known positive numbers). Consider the CG algorithm based on the noisy 
operator A and data b, giving the output at step m 

z m = ArgMin||Az-6|| 2 . (14) 

z&/C m {A,b) 
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The discrepancy principle stopping rule is defined as follows. Consider a fixed constant 
r > 1 and define 



m 



min {m > : \\Az m — b\\ < r(5 \\z m \\ +£:)}. 



We output the solution obtained at step max(0,m — 1) . Consider a minor variation of of 
this rule: 

fh if <2Vn(0) < 

max(0,m — 1) otherwise, 



m 



where q^ is the degree m — 1 polynomial such that z^ = qm{A)b , and r] is an arbitrary 
positive constant such that t] < 1/r . Nemirovskii established the following theorem: 

Theorem 4.2. Assume that (a) max(||A|| , \\A\\) < L; and that (b) z* = A^u* with \\u*\\ < 
R for some /i > 0. Then for any 9 G [0, 1] , provided that fh < oo it holds that 

\\A 9 ( Zfh - z*)\\ 2 < c(/i, r, ^R^ 1 (e + 8Rif)W+M<l+ti . 



4.2 Proof of Theorem I2TT1 

We apply Nemirovskii's result in our setting (assuming r > |): By identifying the approxi- 
mate operator and data as A = S n and b = T*Y, we see that the CG algorithm considered 
by Nemirovskii (TT4"|) is exactly (ITU]) , more precisely with the identification z m = f m . 

For the population version, we identify A = S, and z* = (remember that provided 
r > | in the source condition, then there exists G H such that /* = TffA. 

Condition (a) of Nemirovskii's theorem 14.21 is satisfied with L = k by the boundedness of 
the kernel. Condition (b) is satisfied with /x = r — 1/2 > and i? = K -r p, as implied by the 
source condition SC(r). Finally, the concentration result U]T] ensures that the approximation 

conditions (|T3|) are satisfied with probability 1 — 27 , more precisely with S = ^log ^ and 

e = 4A ^^ log ~- (Here we replaced 7 in (TTTj) and (fl2l) by 7/2, so that the two conditions 
are satisfied simultaneously, by the union bound). The operator norm is upper bounded 
by the Hilbert-Schmidt norm, so that the deviation inequality for the operators is actually 
stronger than what is needed. 

We consider the discrepancy principle stopping rule associated to these parameters, the 
choice rj = l/(2r), and 6 = | , thus obtaining the result, since 

A^ ( Zfh - z*) 



SHfrn-fn) =\\f*-fii 



n\\2 
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4.3 Notes on the proof of Theorems 12.21 and 12.3 



The above proof shows that an application of Nemirovskii's fundamental result for CG 
regularization of inverse problems under deterministic noise (on the data and the operator) 
allows us to obtain our first result. One key ingredient is the concentration property 14.11 
which allows to bound deviations in a quasi-deterministic manner. 

To prove the sharper results of Theorems 12.21 and 12.31 such a direct approach does not 
work unfortunately, and a complete rework and extension of the proof is necessary. The proof 
of Theorem 12.21 is presented in the supplementary material to the paper. In a nutshell, the 
concentration result 14.11 is too coarse to prove the optimal rates of convergence taking into 
account the intrinsic dimension parameter. Instead of that result, we have to consider the 



deviations from the mean in a "warped" norm, i.e. of the form 



(S + \I)-3(T*Y-T*f* 



(S + XI) 2 (S n — S) for the operator (with an appropriate choice of 



HS 



for the data, and 

A > 0) respectively. Deviations of this form were introduced and used in [HI E] to obtain 
sharp rates in the framework of Tikhonov's regularization (J2J) and of the more general linear 
regularization schemes of the form fl3]). Bounds on deviations of this form can be obtained 
via a Bernstein-type concentration inequality for Hilbert-space valued random variables. 

On the one hand, the results concerning linear regularization schemes of the form ([3]) 
do not apply to the nonlinear CG regularization. On the other hand, Nemirovskii's result 
does not apply to deviations controlled in the warped norm. Moreover, the "outer" case 



introduces additional technical difficulties. Therefore, the proofs for Theorems 12.21 and 12.31 
while still following the overall fundamental structure and ideas introduced by Nemirovskii, 
are significantly different in that context. As mentioned above, we present the complete 
proof of Theorem 12.21 in the supplementary material and a sketch of the proof of Theorem 
£31 



5 Conclusion 

In this work, we derived early stopping rules for kernel Conjugate Gradient regression that 
provide optimal learning rates to the true target function. Depending on the situation that 
we study, the rates are adaptive with respect to the regularity of the target function in some 
cases. The proofs of our results rely most importantly on ideas introduced by Nemirovskii 
[TT] and further developed by Hanke [T3J for CG methods in the deterministic case, and 
moreover on ideas inspired by [SI E] ■ 

Certainly, in practice, as for a large majority of learning algorithms, cross-validation 
remains the standard approach for model selection. The motivation of this work is however 
mainly theoretical, and our overall goal is to show that from the learning theoretical point 
of view, CG regularization stands on equal footing with other well-studied regularization 
methods such as kernel ridge regression or more general linear regularization methods (which 
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includes between many others L2 boosting). We also note that theoretically well-grounded 
model selection rules can generally help cross-validation in practice by providing a well- 
calibrated parametrization of regularizer functions, or, as is the case here, of thresholds 
used in the stopping rule. 

One crucial property used in the proofs is that the proposed CG regularization schemes 
can be conveniently cast in the reproducing kernel Hilbert space % as displayed in e.g (TT0|) . 
This reformulation is not possible for Kernel Partial Least Squares: It is also a CG type 
method, but uses the standard Euclidean norm instead of the i^ n -norm used here. This point 
is the main technical justification on why we focus on ([5]) rather than kernel PLS. Obtaining 
optimal convergence rates also valid for Kernel PLS is an important future direction and 
should build on the present work. 

Another important direction for future efforts is the derivation of stopping rules that 
do not depend on the confidence parameter 7. Currently, this dependence prevents us to 
go from convergence in high probability to convergence in expectation, which would be de- 
sirable. Perhaps more importantly, it would be of interest to find a stopping rule that is 
adaptive to both parameters r (target function regularity) and s (intrinsic dimension param- 
eter) without their a priori knowledge. We recall that our first stopping rule is adaptive to r 
but at the price of being worst-case in s. In the literature on linear regularization methods, 
the optimal choice of regularization parameter is also non-adaptive, be it when considering 
optimal rates with respect to r only [2] or to both r and s [5] . An approach to alleviate this 
problem is to use a hold-out sample for model selection; this was studied theoretically in [5] 
for linear regularization methods (see also [5] for an account of the properties of hold-out 
in a general setup). We strongly believe that the hold-out method will yield theoretically 
founded adaptive model selection for CG as well. However, hold-out is typically regarded 
as inelegant in that it requires to throw away part of the data for estimation. It would be 
of more interest to study model selection methods that are based on using the whole data 
in the estimation phase. The application of Lepskii's method is a possible step towards this 
direction. 
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A Supplementary Material 
A.l Notation 

We follow the notation used in the main part, in particular the operators T n ,T*,T,T* 
defined in Section 4.1, and we recall that S n := T*T n ; S = T*T; K n = T n T*; and K = TT*. 

We denote by (£i)i>i the possibly finite sequence in [0, k] of nonzero eigenvalues of S 
and K , and by (Cj, n )i<j<n the n-sequence of eigenvalues of S n and K n respectively (in each 
case in decreasing order and with multiplicity). Finally, (F u ) u > denotes the spectral family 
of the operator S n , i.e. F u is the orthogonal projector on the subspace of T-L spanned by 
eigenvectors of S n corresponding to eigenvalues strictly less than u. 
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It is useful to consider the spectral integral representation: If (e^ n )i<i< n denotes the 
orthogonal eigensystem of S n associated to the non-zero eigenvalues (\i, n )i<i<n , for any 
integrable function h on [0, k], we set 

/ h(u)d ||F u , n T,:Y|| 2 := (T*Y, h{S n )T* n Y) = ]T h{\ n ) (T*Y, e i>n ) 2 . 

By its definition, the output of the m-th iteration of the CG algorithm can be put under 
the form f m = q m (S n )T*Y , where q m G V m -\ , the set of real polynomials of degree less 
than m — 1 . A crucial role is played by the residual polynomial 

p m (x) = l-xq m (x) G V^, 

where is the set of real polynomials of degree less than m and having constant term 
equal to 1. In particular T*Y — S n f m = p m (S n )T*Y . Furthermore, the definition of the 
CG algorithm implies that the sequence (pm)m>o are orthogonal polynomials for the scalar 
product [., .](i), where for % > we define 

:= <p(S n )T:Y,^g(S n )T n *Y) = f p(u)q(u)u l d ||F n , n T n *Y|| 2 . 



This can be shown as follows: p m is the orthogonal projection, of the origin onto the affine 
space = 1 + xV m -i with the scalar product [., .]( ), , where xV m -i denotes (with some 
abuse of notation) the set of polynomials of degree less than m with constant coefficient equal 
to zero. Thus = [p m ,xq]^ = [p m , q]m for any q G V m -\ . From the theory of orthogonal 
polynomials, it results that for any m < rrif ina i := # {i : 1 < i < n, £ i>n (T*Y, ej )H ) ^ 0} , 
the polynomial p m has exactly m distinct roots belonging to [0, k] , which we denote by 
(xk,m)i<k<m (in increasing order). Finally, we use the notation c(a,b) to denote a function 
depending on the stated parameters only, and whose exact value can change from line to 
line. 



A. 2 Preparation of the proof 

We follow the general architecture of Nemirovskii's proof to establish rates. We recall that 
since we assume r > 1/2, the representation /* = Tf^ holds. The main difference to 
Nemirovskii's original result is that (similar to the approach of [HI E]) we use deviation 
bounds in a "warped" norm rather than in the standard norm. More precisely, we consider 
the following type of assumptions: 



B1(A) (S + \I)-HT:Y - S n &) <6{X) 
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B2(A) \\(S + XI)(S n + A-Tr 1 )) < T 2 , with T > 1 

(this implies in particular (S + A/) 5 (S n + A/) -5 < T via flMl) below) , 

B3 \\S-S n \\ < kA. 

In the rest of this section we set \i — r — 1/2. Under the source condition assumption 
SC(r), for r > | the representation /* = K r u can be rewritten 

/* = (TT*) r u = T(T*T) r -^(T*T)-^T*u = TS^(T*T)-^T*u, 

by identification we therefore have the source condition for given by = S^w with 
w = (T*T)~2T*u, and || 

^11% ^ || ^||) since (T*T) 2T* is a restricted isometry from L/2(Px) 

into 

We define the shortcut notation 

7 (\\ S X " for/i<l, 

= < „ A , 1 ( 15 ) 

I k^A for /i > 1. 

We start with preliminary technical lemmas, before turning to the proof of Theorem 2.2. 

Lemma A.l. For any A > , if assumptions SC(r) ; B1(A) ; B2(A) and B3 hold, then for 
any iteration step 1 < m < mf ina i 

\\T:(T n f m - Y)n < C (/,)t 2 (b' m (o)r ( ^ +i) + z,{\) ^(o)r 1 ) K -^ P 

+ {\p m (o)H + ^)r5(\). (16) 

Proof. Recall that (xk, m )i<k<m denote the m roots of the polynomial p m ; define further the 
function ip m on the interval [0, Xi >rn ] as 

Vmix) = Pm(x) ( Xl,m _ 

Following the idea introduced by Nemirovski, it can be shown that 

||r n *(r n / m -Y)|| = |b m (5 n )r n *Y|| 

< ||i^ ro (S n )T„*Y|| 

< \\F Xhm <p m (S n )S n f^\\ + \\F Xl ^ m {S n ){T* n Y - S n fit)\\ := (/) + (//). 

Above, the first inequality (lemma 3.7. in |13j) is the crucial point, and relies fundamentally 
on the fact that (p m ) is an orthogonal polynomial sequence. 
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We start with controlling the second term: 

(//) = \\F xltm ip m {S n ){T*Y - S n t n ) 

F XhmVm (S n )(S + XI)HS + AJ)-i(T n *Y - SJ^ 

F XlrnVm (Sn)(S n + XI)^ T5(X) 



< 



< 



sup x 2 y? m (x) + A 2 sup ip m (x) ) T<5(A) 
] xe[o 



< (b' m (0)P + Ai)T5(A), 
where the last line used the inequality (see (3.10) in [13 



sup iVI(i) < v v \p m 



(17) 



for any v > (using the convention 0° = 1), which we applied above for v — 0, 1 . For the 
first term, we use assumption SC(r); first consider the case /x > 1: 

(/) = \\F Xlm ip m (S n )S n fu\\ = \\F Xlm tp m (S n )S n S' J, w\\ 

< (||F xl ^ m (S n )^ +1 || + \\F xl ^ m (S n )S n \\ \\S» - S%\\) K-^p 

where we applied ( TTT1) with z/ = 2(/i + 1), z/ = 2 and (13"3"l) . □ 
For the case /i < 1, using (132]) and arguments similar to the previous case: 
(-0 = \\F Xl , m Pm(S n )S n fu\\ 

= WFx^^miS^SnS^wW 

< \\F xltm cp m {s n )s n (s n + \iy\\ \\(s n + \i)-»(s + xir\\ ||(5 + a)-^|| K-^p 

< c(/i)T 2 (|^(0)r ( ^ +1) + A" Ip'JOT 1 ) *-^P- 

Lemma A. 2. For any A > , if assumptions SC(r), B1(A) ; B2(A) and B3 hold, then for 
any iteration step 1 < m < mfi na i, for any e G (0, xi^ m ): 

\\T(f m - WW <T (1 + A (|^(0)| + e' 1 )) T8(X) + c^T 2 ( e i + A*) (e" + Z,{X)) k~^ P 
+ yft [ 1 + ^ ) e~t K{T n f„ 

£2 



If m = 0, the above inequality is valid for any e > 0. 
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Proof. Set j m = qm(S n )S n fu . This is the element in % that we obtain by applying the 
mth-iteration CG polynomial q m to the noiseless data. We have 



\\T(f m - fn) 



SHf m -fn) 



< T 



(S n +\I)2(f m -&) 



< T| 



F e {S n + \I)*(f m -f n 



F £ {s n + \i)*{f m -r H/ 



+ 



F^{s n + \i)-*{f m -r H/ 



:= T((J) + (II) + (///)) , 
where we denote F £ := (I — F e ). First summand: 



(/)= F e (S n + \I)*{f m -f„ 



F £ (S n + \I)*q m (S n )(S + \I)*{S + A/)-s (T„*Y - S n /*) 
F £ (S„, + A/)^ m (S' n )(5 n + A/)^ 5(A) 



< T 



< Y<5(A) sup xq m (x) + A sup g m (x) 

yxe[0,£] a;6[0,£] 

<T5(A)(1 + A|p' m (0)|) . 



The last inequality is obtained by the following argument: if m > 1, since e < xi, m , p m is 
convex in [0, e] , we have 



9mW = < \Pn 

X 



for x G [0, e] ; 



and also xg m (x) = 1 — p m (x) < 1 for x G [0, e] . If m = 0, we have po = 1 and g m = 0, so 
that the above is also trivially satisfied for any x. 

Second summand: first subcase, (j, > 1, using (133|) . and the fact that |p m | (x) < 1 for 
x G [0,e]: 



< 



F e (S n + \I)Hf m -& 

F E (S n + \I)hm(Sn)S»W 



F e (S n + \I)* Pm (S n )S£ + F £ (S n + \I)i Pm {S n ) c(^)k"A Ur^v 



< + A^ + c(^)«" (e* + A*) a) /sT^p 
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Bounding the second summand: second subcase, /i < 1: 

(//) = \\F £ (S n + \I)hm(Sn)S»w\\ < \\F £ (S n + \ir + hm(Sn)\\r 2 K-»- l i P 

< c(fj)(e + A)^5T 2 K^-5p. 

Third summand: 



(III) 



F^{S n + \I)i{f m -f, 



< 



< 



< 



< 



< 



;g + A)2 | A i (e + A)j 



£2 



i • - 1 : : - 



£ 
i 



< F^SlUra-fu) +X\\FtUm-& 

F^{s n + \i)-^s n {f m -r H ) 

F^S n + \I)^S n U m -fu) 

£2 / \ fc / 

1 + ^r) (l + 2 (||i^(S B + \I)^T* n {T n f m - Y)|| + |(5„ + A/)~^ (T T *Y - S n /*)||) 



A \ 



1 + ^ I e"* IK*(T n / m - Y)|| + v/2T ( 1 + - ) (S + A/)~^(T*Y - S n /* ) 



A 



£2 



1 + ^ I e"' r:(T n / m - Y)|| + v/2T ( 1 + - ) 5(A) . 



A A 



£2 



□ 



We now consider the sequence of polynomials that are orthogonal with respect to the 
scalar product [., .](2), which we denote by pm , and its roots by 4'. 

Lemma A. 3. For any A > 0, if assumptions SC(r) ; B1(A) 7 B2(A) and B3 hold, then for 
any iteration step 1 < m < mf ina i, for any e G (0, £i, m -i).' 



[p m _i,p m -i] ( 2 0) = ||p m _i(S n )T*Y| 



< T(e + A)3<J(A) + c(/i)T 2 £ (e" + Z„(A)) V + £ ~\ 



(2) (2) 
Pm— 1) Pm-1 



2 

(1) 



(18) 



Proof By the optimality property defining our CG algorithm, 



Ib^^YH < £Us n )T:Y < F £ p^USn)T:Y + i^p^^Y 



J 2 ) 



J 2 ) 



< ||F £ T n *Y|| + e -* £US n )SlT:Y = \\F £ T:Y\\+s^ 



„(2) (2) 



(1) 
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For the last inequality, we have used the fact that Ip^-ilO^) < 1 for x G [0, , along 

(2) 

with the assumption < e < £i, m _i < x\^_ 1 ; the latter inequality is due to interlacing 
properties of the roots of orthogonal polynomials for [., ]^ and [., (see [13], Cor 2.7). 

We now bound 



,(2) 



|F £ T*Y|| < \\F £ {T:Y - S n f H) 



\F £ S n S»w\ 



< 



F £ (S n + (S n + \I)-HT*Y - S n f H ) 

<T{e + \)^5{\) + \\F £ S n S^w\\ ; 

for the second term, we divide as usual into two cases: for /i > 1: 



\F £ S n S»w\\ < WF^wW + WF^niS^-S^wWKecifi) {e^ + k"A) K ~^p , 



and for \i < 1: 



\F £ S n S»w\\ < \\F £ S n (S n + AJ)"|| T 2 K~^p < e( £ » + A")TV 



'P- 



□ 



A. 3 Proof of Theorem 2.2 

We fix 

A* = ( iyAD j \/n) log (6/7) ) 2m+^+t k _ (19) 

and assume n is big enough to ensure A* < k. Furthermore we denote A* = k -1 A* (this 
normalization was introduced in [6]). 

We rewrite equivalently the discrepancy stopping rule as follows: for some fixed r > , 

m:=min{0>m: \\T*(T n f m — Y)|| < (2 + t)\U(\*) } , (20) 

where 

5(\J := ^MX^ . (21) 

(Observe that the above r > is deduced from the constant t' > 3/2 considered in the 
main part of the paper via r = g(r — |).) 

We first check B1(A*), B2(A*) and B3 are satisfied simultaneously with large proba- 
bility, using for this concentration results which are recalled in Section IA.5I Concerning 
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B1(A*) , inequality ( |3T|) ensures that with probability 1 — 7 , we have 



(s + \j)-ht*y - sjz: 



< 2M 
2M 



n 



log 



7 



< 



dx: 



1 



2D 2 



4 £.og^A- 

n 7 



1 6 
log- 

7 



< f A^ 

< 7 MA*" 1 
4 



1 



2D 2 

5(A*) , 



A? +s 



(22) 



where we have used SC(r), ffT^l) and the assumptions -D > 1 and A* < 1 . We now turn 
to B2(A*) . Inequality ( |32|) along with a repetition of the above reasoning yields that with 
probability 1 — 7: 

(s + \jrHs„ - s) 



so that 



Observe that 



(s+\j)-5(s n -s)(s+\jy 



<^a;m(a,). 



fA- 5 (A, 



3r„ 3 
4 * - 4 



(23) 



so that with Lemma |A.5[ we obtain that B2(A*) is satisfied with T := 2 (with probability 
1 — 7). Finally, equation (11) in the main paper implies that (B3) is also satified with 
probability 1 — 7, with 

A : = A log -. (24) 
V n 7 

To conclude, by the union bound, the event that B1(A*), B2(A*) and B3 satisfied simul- 
taneously has probability larger than 1 — 37 , and we assume for the rest of the proof that 
we are on this event. 

We will assume m > 1 for the remainder of the proof and postpone to the end the 
(simpler) case m — 0. 

First step: upper bound on [p^_i(0)| . By definition of the stopping rule we have 

1 

||T*(T n /jft_i — Y)|| > (2 + r)A* 5(A*) . Now applying this together with the upper bound of 
Lemma [A. II we get 



+ Z M (A*) |pk-i(0)r kT^P + 2 ^(O)! 5 5(A*) 



< 3max (2 |pk-i(0)| 1 c{p)p^ |^(0) | , c(//) P k-^^(A*) |*4-i(0)| ') ■ 
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We examine in succession the possibility that the maximum in the above expression is 
attained for each of the terms which comprise it. If the first term attains the maximum, 
this implies |f4i_i(0)| < (9/r 2 )A 4 T 1 . If the second term attains the maximum, this entails 



cfr)pK-rt |i4-l(0)P^ >T\U(\*) 



(H-l) 



which using (j2lj) yields: 



|?4-i( )| < c 0> r ) 



M 



Finally, if the third term attains the maximum, we have 



which using f[2~Tj) yields: 



|pk-i(0)| <c(//,r)^A^-%(A*). 



We now establish the inequality 



(25) 



The inequality is trivial if \x < 1 given the definition of Z M (A*) in (IT5|) . If // > 1 holds, from 

^ 2jj+s+1 

the definition (|24|) . it holds that A < |A* 2 , hence 

Z M (A*)A^ = AA> < ^ < i . 
Gathering all three cases, we obtain that it always holds that 



|?4-i (o) I < r ) max ' x ) A * 1 



(26) 



Second step: upper bound on |p~(0)| . For this we use the result of the first step and 
relate |p^_i(0) to |p^(0)| . It is a property of orthogonal polynomials (see Hanke, Corollary 
2.6) that for any m > 1 



\Pm-liPm-l](o) [Pm,Pm}( ) [Pm-1, Pm-l}( ) 
Pm-l (0) - p m (0) = F --, < 



(2) (2) 
Vm—\i Pm-l 



(1) 



(2) (2) 
V m -\i Pm—1 



(27) 



(l) 



To upper bound the above quantity, we apply Lemma IA.3I whithe the choice A = A* and 

M 



£ = £*:= a(p, t) min ( — , 1 ) A* 
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where < a(pi, r) < 1 should be chosen small enough in order to satisfy some constraints to 
be specified below. The first constraint is the requirement G (0,xi /m _i) in order to apply 
Lemma IA.31 For this, it can be seen from (1261) that a(/x, r) can be chosen small enough to 
ensure 

< |p^_i(0)|~ < xi >m _i , 

the last inequality is an easy consequence of the fact that p m -i has exactly (m — 1) positive 
real roots and p m _i(0) = 1 . We now turn to upper bound the following quantity appearing 
on the RHS of (jlgj): 

T(e, + K)?8(K) + c(/i)T 2 e, (e£ + Zp{K)) n-^p 

< 2(a(/i, r) + l)Af 5(A*) + c(fi)a(fi, r) min (p, M) \*\%K~^p 

i ( 28 ) 

< (c(n)a(/i,T)+2)XU(X*), 

where we have used the definition (T2~T]) for 5(A*) and inequality Z M (A*) < A£, see (1251) . Now, 
we chose a(/i, r) so that the factor in the last display satisfies c(p)a(p, r) < | . Remember 
that the definition of the stopping rule entails 

[Pm-i,Pm-i]f 0) = WKiTJ^ - Y)|| > (2 + r)A|5(A*) > (2 + r)A|5(A*) , (29) 
Now combining (TTS]), Q22D and (J2HD, we obtain 



(2) (2) 
Pm—\i Pm—1 



2 

) 



J(l) 



1 - t^t^J [p^-i>Pm-i](o) < ^ 

using this inequality in relation with f[2"Tj) and (l2T)j) . we obtain 

bk(0)| < bk-i(0)| + c(r)^ 1 < c(p, r) max (A l) A; 1 . 

Final step. We apply Lemma [A. 2 1 (with A = A* and e = £*), together with the bound 
on |p~(0)| just obtained, and the inequality (by definition of the stopping rule) 

\\T:(T n fa-Y)\\<(2 + T)\l5(K), 

obtaining, using again (j25|) : 

n/*-rii 2 = iiT(/«-^)ii 

< c(p, r) (S(\*) max (-^, l) + min(p, M)\t + ^ < c(/i, r)(M + p)A^ . 



If m — 0, we can apply directly Lemma IA.2I as above without requiring the two previous 
steps, since in this case p' (0) = 0, so that we obtain the same final bound. 
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A. 4 Sketch of the proof of Theorem 2.3 

For the proof of Theorem 2.3, the condition B1(A) is replaced by 
Bl'(A) (S + \I)- 1 i{T*Y -T*f*) <8(X). 

We check that B1'(A*), B2(A*) and B3 are satisfied in the setting of Theorem 2.3. To 
check Bl^A*), we use ( 130]) instead of (I3TI) . Since the easily checked relation T*Y = T?Y 
holds, the upper bound obtained here has the same form as for Theorem 2.2, therefore we 
can use the same value S(X*) for condition Bl^A*) as in the previous section, given by (12"2"|) . 
Notice however that we must now use the condition fi + s = r + s — | > to ensure that 
the chain of inequalities leading to fj22|) is valid. 

For condition B2(A*), we can apply the deviation inequality f )32|) but with n replaced 
by n, since we make use of all the unlabeled data. Using the fact that " < A 4 T ( ' 1 ~ 2r ' )+ and 
some elementary algebra leads to B2(A*) being satisfied with T := 2. 

Finally condition B3 is satisfied with A given by §M§ with n replaced by n. 

Once these conditions are established, intermediate results similar in structure to Lem- 
mas [ATTJ [Al2] and |A]3] can be derived, but where B1(A) is replaced by Bl'(A). The details 
are omitted here. 



A. 5 More technical lemmas 

In this section we collect some technical lemmas which underpin the main results. These 
are taken from previous sources and are recalled here for completeness. The main statistical 
tool is the following deviation inequality: 

Lemma A. 4. Let A be a positive number. Under assumption (Bounded), the following 
holds: 



P 



(S + \I)-*(T*Y -T*f* 



< 2M 



A/"(A) 



l 6 
log- 

7 



> 1-7. 



(30) 



// the representation f* = Tfy holds and under assumption (Bernstein), we have the 
following: 



P 



(S + XirHT^Y - S n f^ 



< 2M 



Af(X) 2VM, 6 



+ 



n 



log 



n 



7 



> 1 -7. 



(31) 



Finally, the following holds: 



P 



(S + XI)-i(S n -S) 



HS 



< 



n V An / 7 



> I-7. 



(32) 



where we recall that \\-\\ HS denotes the Hilbert- Schmidt norm. 
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The proof can be found in [7j, and is based on a Bernstein- type inequality for random 
variables taking values in a Hilbert space, as established in |20j 125] . 
Inequality ( 132|) can be fruitfully combined with the following: 

Lemma A. 5. Assume there exists rj > such that the following inequality holds: 

< 1 -V, 



then 



(s + \)-Hs n -s)(s + x) 



< 



Proof. First we have 

(s + \)*(s n + \)-* = (s + x^(s n + xy l (s + x)^ 

then simple algebraic manipulation shows 

(S + X) 1 *(S n + X)-\S + X) l 2 = (i - (S + \)*(S - s n )-\s + A)^ -1 . 



Finally, using the inequality ||(J — A) 1 \ 
the conclusion. 



E fc >o^ fc ll < (! - PUT 1 for \\ A \\ < 1 y ields 



□ 



We make use of the following operator inequalities: 



Lemma A. 6. Let A,B be two positive, self-adjoint operators with max(\\A\\ , \\B\\) < C . 
Then for any r > , putting ( = (r — 1) + ; the following inequality holds: 



\A r -B r \\ < (C + 1)C C \\A - B 



ir-C 



(33) 



Proof. Follows from the fact that the power function x i— > x r is operator monotone for r < 1 
and Lipschitz with constant rC r ~ x on [0, C] if r > 1. □ 

Lemma A. 7 ([TJ, Theorem IX. 2. 1-2). Let A,B be to self-adjoint, positive operators. Then 
for any s G [0, 1]: 

\\A S B S \\ < \\AB\\ S . (34) 

Note: this result is stated for positive matrices in [TJ, but it is easy to check that the 
proof applies as well to positive operators on a Hilbert space. 
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